In this exercise we use Boston data from MASS-library. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. Data includes 14 variables and 506 rows
# access the MASS package and load other libraries for later analysis
library(MASS)
library(corrplot)
## corrplot 0.84 loaded
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# load the data
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
#Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
pairs(Boston)
There are some very interesting distributions fo variables in the plot matrix. Variable rad has high and low values so the plot shows that the values are concentrated on either side of the plot.
#Correlation matrix#
#calculating the correlation matrix and correlation plot
cor_matrix <- round(cor(Boston),digits = 2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Plotted correlation matrix shows that there is some high correlation between variables: * Correlation is quite clear between industrial areas (indus) and nitrogen oxides (nox). Industry adds pollution in the area. Industry seems to correlate also with variablrs like age, dis, ras and tax. * Nitrogen oxides (nox) correlations are very similar with industry (indus) * Crime rate (crim) seems to correlate with good accessibilitty to radial highways (rad) and value property (tax). * Old houses (age) and employment centers have also something common
#Scaled data# All the variables are numerical so we can use scale()-function to scale whole data set
#Standardize the dataset and print out summaries of the scaled data. How did the variables change?
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)
Scaling the data makes variables look as if they are in the same range. Variables like black and tax were before scaling hundred fold compared to some other variables
#Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset.
#save the scaled crim as scaled_crim
scaled_crim <- boston_scaled$crim
#create a quantile vector of crim and print it
bins <- quantile(scaled_crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
#create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
#look at the table of the new factor crime, do not change the actual variable "crime"
crime_tab <-table(crime)
crime_tab
## crime
## low med_low med_high high
## 127 126 126 127
#remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv crime
## Min. :-1.9063 low :127
## 1st Qu.:-0.5989 med_low :126
## Median :-0.1449 med_high:126
## Mean : 0.0000 high :127
## 3rd Qu.: 0.2683
## Max. : 2.9865
#Train and test set# Training set contains 80% of the data. 20% is in the test set
#Divide the dataset to train and test sets, so that 80% of the data belongs to the train set
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
ind
## [1] 316 95 343 16 386 397 395 480 430 222 151 263 448 301 81 57 481 204
## [19] 10 335 417 490 147 366 193 66 186 139 224 265 103 484 438 79 453 87
## [37] 156 268 350 230 294 179 124 344 254 429 97 362 248 80 223 208 72 49
## [55] 367 455 357 405 359 428 412 260 497 229 345 479 443 129 169 495 37 134
## [73] 173 449 201 440 46 394 219 214 411 280 282 88 458 425 419 226 500 473
## [91] 437 238 122 450 291 390 308 289 471 295 69 463 433 211 77 123 182 131
## [109] 113 333 162 243 58 299 486 112 393 418 326 403 47 19 361 18 304 459
## [127] 498 253 423 478 63 355 174 255 126 23 121 101 330 353 324 273 485 43
## [145] 468 34 93 172 109 476 457 144 163 55 271 451 3 317 492 431 256 474
## [163] 409 65 125 199 118 198 149 445 12 297 212 432 90 210 110 491 61 85
## [181] 100 311 161 258 111 346 167 178 227 98 74 175 36 70 402 78 257 105
## [199] 272 462 447 358 332 33 127 108 120 242 166 392 187 290 259 234 96 336
## [217] 128 300 143 482 483 396 328 348 373 374 385 506 1 310 388 247 15 415
## [235] 94 377 31 11 154 26 30 39 307 48 442 28 309 41 303 364 249 414
## [253] 465 4 283 177 488 35 155 209 7 142 399 284 146 250 413 135 221 82
## [271] 427 460 446 17 499 338 325 237 331 376 281 341 241 14 354 276 305 387
## [289] 240 189 436 73 424 381 329 278 153 75 318 466 215 505 24 232 119 92
## [307] 233 384 53 44 71 494 176 426 183 165 391 288 130 407 52 370 217 334
## [325] 489 435 368 375 469 185 60 164 298 207 422 421 401 159 275 274 13 9
## [343] 195 203 152 477 76 99 191 137 454 114 132 171 27 322 269 267 372 141
## [361] 441 251 475 363 383 2 200 285 360 404 220 337 369 206 91 236 50 235
## [379] 245 32 192 20 157 158 231 107 410 51 42 277 40 313 45 22 29 202
## [397] 190 216 116 389 270 67 188 314
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
#Fitting the Linear Discriminant Analysis# First the linear discriminant analysis (LDA) is fitted to the train set. The new categorical variable crime is the target variable and all the other variables of the dataset are predictor variables. After fitting we draw the LDA biplot with arrows
#Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.
#LDA = linear discriminant analysis
lda.fit <- lda(crime ~. , data = train)
#print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.230198 0.259901 0.250000 0.259901
##
## Group means:
## zn indus chas nox rm age
## low 0.9848748 -0.8913464 -0.060657012 -0.8858935 0.40613137 -0.9109656
## med_low -0.1295224 -0.2660168 -0.009855719 -0.5722028 -0.13731804 -0.3338593
## med_high -0.3921463 0.1682437 0.273407599 0.3225319 0.07957416 0.3923043
## high -0.4872402 1.0170492 -0.047351911 1.0594932 -0.49853894 0.8176379
## dis rad tax ptratio black lstat
## low 0.8919236 -0.6805530 -0.7041017 -0.5208337 0.37586058 -0.76197955
## med_low 0.3084028 -0.5673293 -0.5012264 -0.1136332 0.31406103 -0.11101676
## med_high -0.3394072 -0.4019522 -0.3079471 -0.2319080 0.04978242 0.01618115
## high -0.8576864 1.6388211 1.5145512 0.7815834 -0.79066183 0.88828778
## medv
## low 0.503955981
## med_low -0.004913207
## med_high 0.196452886
## high -0.679556208
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.133189090 0.83577575 -0.89778011
## indus -0.008191055 -0.18567308 0.32878705
## chas -0.030142652 -0.04563550 0.02965824
## nox 0.493766376 -0.62038787 -1.44830952
## rm 0.019536065 -0.06581342 -0.10889878
## age 0.229659624 -0.33824047 -0.20192232
## dis -0.098648715 -0.27627282 0.02513121
## rad 3.187163335 1.01390108 0.16479222
## tax -0.051651053 -0.12295072 0.40318771
## ptratio 0.180523597 -0.07025777 -0.35559703
## black -0.118739188 0.04921520 0.15369832
## lstat 0.223416705 -0.28177308 0.29015840
## medv 0.129746062 -0.48266575 -0.36170016
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9511 0.0368 0.0121
#the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#target classes as numeric
classes <- as.numeric(train$crime)
classes
## [1] 2 1 1 3 4 4 4 4 4 3 3 3 4 1 1 1 4 1 2 1 4 2 3 4 2 1 1 2 3 3 2 3 4 1 4 1 3
## [38] 3 1 3 2 1 2 1 3 4 2 4 2 2 3 2 2 2 4 4 4 4 4 4 4 3 3 3 1 4 4 3 3 3 2 3 2 4
## [75] 1 4 2 4 2 2 4 2 1 1 4 4 4 3 2 3 4 3 1 4 1 4 1 1 4 1 2 4 4 2 2 2 1 3 2 1 3
## [112] 2 1 1 3 2 4 4 2 4 2 3 4 3 2 4 3 2 4 4 2 1 2 1 2 3 1 2 1 1 3 2 3 2 4 3 1 3
## [149] 2 4 4 4 3 1 3 4 1 3 2 4 1 4 4 1 2 1 2 1 3 4 2 1 3 4 1 3 3 2 2 1 1 3 3 3 2
## [186] 1 3 1 3 2 2 2 1 2 4 2 1 2 2 4 4 4 1 3 3 2 2 2 3 4 1 1 3 3 2 1 3 1 3 4 4 4
## [223] 2 1 4 4 4 1 1 3 4 3 3 4 1 4 3 2 3 3 3 2 1 2 4 3 3 1 2 4 2 4 4 1 1 1 4 3 3
## [260] 2 2 3 4 1 3 2 4 3 3 1 4 4 4 3 2 1 3 3 1 4 1 1 2 3 1 2 1 4 2 2 4 2 4 4 1 1
## [297] 3 1 2 3 3 2 3 3 2 1 3 4 1 2 2 2 1 4 2 3 4 1 3 4 1 4 1 1 2 4 4 4 4 2 2 3 2
## [334] 2 4 4 4 3 1 2 2 2 1 1 3 4 2 1 2 3 4 2 3 3 3 2 3 3 4 3 4 2 4 4 4 1 1 1 4 4
## [371] 2 1 4 2 1 3 2 3 2 3 1 3 3 3 3 2 4 2 2 2 1 3 2 3 3 1 2 2 2 4 2 1 1 3
#plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 3)
#Predicting the classes#
#Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results
#save the correct classes from test data
correct_classes <- test$crime
#remove the crime variable from test data
test <- dplyr::select(test, -crime)
#predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
#cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 19 11 4 0
## med_low 5 9 7 0
## med_high 0 5 19 1
## high 0 0 0 22
Prediction were quite good. There was some errors in the middle of the range but classes low and especially high were good. Only one correct representative of high class was predicted to med_low class.
#Reload the Boston dataset and standardize the dataset (we did not do this in the Datacamp exercises, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results
#Loading and scaling Boston data
scaled_Boston <- scale(Boston)
summary(scaled_Boston)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
#calculating the euclidean distance matrix between the observation
dist_eu <- dist(scaled_Boston)
#determining the max number of clusters
cluster_max <- 15
#calculate the total within sum of squares
#K-means might produce different results every time, because it randomly
#assigns the initial cluster centers. The function set.seed() can be used to
#deal with that.
set.seed(123)
twcss <- sapply(1:cluster_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results
plot(1:cluster_max, twcss, type='b')
One way to determine the number of clusters is to look how the total of within cluster sum of squares (WCSS) behaves when the number of clusters changes. WCSS was calculated from 1 to 15 clusters. The optimal number of clusters is when the total WCSS drops radically. It seems that in this case optimal number of clusters is two. However we are here to learn so I decided to analyse model with four clusters.
After determining the number of clusters I run the K-means alcorithm again
#k-means clustering
km <-kmeans(dist_eu, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
It seems that when the data is divided to four clusters there is some clear differences in distriputions of several variables. Crim, zn, indus and blacks are variables were one can distinguish clear patterns between clusters. Some variables (rad & tax) show that sometimes 1 or 2 clusters make a clear distripution but observation of other two clusters are ambigious and there is no clear pattern to be regognised.
#BONUS: LDA using clusters as target classes#
#Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?
#Loading and scaling Boston data
scaled_Boston <- scale(Boston)
scaled_Boston <- as.data.frame(scaled_Boston)
#colnames(scaled_Boston)
#calculating the euclidean distance matrix between the observation
dist_eu <- dist(scaled_Boston)
#k-means clustering
set.seed(123)
km <-kmeans(dist_eu, centers = 4)
cm <- as.data.frame(km$cluster)
#adding the clusters to the scaled dataset
scaled_Boston <- data.frame(scaled_Boston, clust = cm)
colnames(scaled_Boston)[15] <- "clust"
summary(scaled_Boston)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv clust
## Min. :-1.5296 Min. :-1.9063 Min. :1.000
## 1st Qu.:-0.7986 1st Qu.:-0.5989 1st Qu.:2.000
## Median :-0.1811 Median :-0.1449 Median :3.000
## Mean : 0.0000 Mean : 0.0000 Mean :2.943
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683 3rd Qu.:4.000
## Max. : 3.5453 Max. : 2.9865 Max. :4.000
#Original Boston dataset is now scaled and the result of K-means clustering is saved to the variable *clust*
#LDA = linear discriminant analysis
lda.fit.km <- lda(clust ~. , data = scaled_Boston)
#print the lda.fit object
lda.fit.km
## Call:
## lda(clust ~ ., data = scaled_Boston)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.1304348 0.2272727 0.2114625 0.4308300
##
## Group means:
## crim zn indus chas nox rm age
## 1 1.4330759 -0.4872402 1.0689719 0.4435073 1.3439101 -0.7461469 0.8575386
## 2 0.2797949 -0.4872402 1.1892663 -0.2723291 0.8998296 -0.2770011 0.7716696
## 3 -0.3912182 1.2671159 -0.8754697 0.5739635 -0.7359091 0.9938426 -0.6949417
## 4 -0.3894453 -0.2173896 -0.5212959 -0.2723291 -0.5203495 -0.1157814 -0.3256000
## dis rad tax ptratio black lstat
## 1 -0.9620552 1.2941816 1.2970210 0.42015742 -1.65562038 1.1930953
## 2 -0.7723199 0.9006160 1.0311612 0.60093343 -0.01717546 0.6116223
## 3 0.7751031 -0.5965444 -0.6369476 -0.96586616 0.34190729 -0.8200275
## 4 0.3182404 -0.5741127 -0.6240070 0.02986213 0.34248644 -0.2813666
## medv
## 1 -0.81904111
## 2 -0.54636549
## 3 1.11919598
## 4 -0.01314324
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim -0.18113078 0.5012256 0.60535205
## zn -0.43297497 1.0486194 -0.67406151
## indus -1.37753200 -0.3016928 -1.07034034
## chas 0.04307937 0.7598229 0.22448239
## nox -1.04674638 0.3861005 0.33268952
## rm 0.14912869 0.1510367 -0.67942589
## age 0.09897424 -0.0523110 -0.26285587
## dis -0.13139210 0.1593367 0.03487882
## rad -0.65824136 -0.5189795 -0.48145070
## tax -0.28903561 0.5773959 -0.10350513
## ptratio -0.22236843 -0.1668597 0.09181715
## black 0.42730704 -0.5843973 -0.89869354
## lstat -0.24320629 0.6197780 0.01119242
## medv -0.21961575 0.9485829 0.17065360
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.7596 0.1768 0.0636
#the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#target classes as numeric
classes <- as.numeric(scaled_Boston$clust)
#classes
#plot the lda results
plot(lda.fit.km, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit.km, myscale = 3)
#Super-bonus# 3D plot where observations color is the crime classes of the train set
model_predictors <- dplyr::select(train, -crime)
#check the dimensions
#dim(model_predictors)
#dim(lda.fit$scaling)
#matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#3d plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
3D plot where observations color is based on the K-means clusters
model_predictors <- dplyr::select(scaled_Boston, -clust)
#check the dimensions
#dim(model_predictors)
#dim(lda.fit.km$scaling)
#matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit.km$scaling
matrix_product <- as.data.frame(matrix_product)
#3D plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = scaled_Boston$clust)
Colors of the both plots is based to four classes. It seems that K-means plot shows the different clusters more clearly than the plot that is based on the crime classification.